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Abstract. Graphs and networks provide a canonical representation of 
relational data, with massive network data sets becoming increasingly 
prevalent across a variety of scientific fields. Although tools from math- 
ematics and computer science have been eagerly adopted by practi- 
tioners in the service of network inference, they do not yet comprise 
a unified and coherent framework for the statistical analysis of large- 
scale network data. This paper serves as both an introduction to the 
topic and a first step toward formal inference procedures. We develop 
and illustrate our arguments using the example of hypothesis testing 
for network structure. We invoke a generalized likelihood ratio frame- 
work and use it to highlight the growing number of topics in this area 
that require strong contributions from statistical science. We frame our 
discussion in the context of previous work from across a variety of dis- 
ciplines, and conclude by outlining fundamental statistical challenges 
whose solutions will in turn serve to advance the science of network 
inference. 
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quire analogs to classical inferential procedures. 

While past decades have witnessed a variety of 
advances in the treatment of graphs and networks 
as combinatoric or algebraic objects, corresponding 
advances in formal data analysis have largely failed 
to keep pace. Indeed, the development of a success- 
ful framework for the statistical analysis of network 
data requires the repurposing of existing models and 
algorithms for the specific purpose of inference. In 
this paper, we pose the question of how modern sta- 
tistical science can best rise to this challenge as well 
as benefit from the many opportunities it presents. 
We provide first steps toward formal network in- 
ference procedures through the introduction of new 
tests for network structure, and employ concrete ex- 
amples throughout that serve to highlight the need 
for additional research contributions in this burgeon- 
ing area. 
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1. INTRODUCTION 

Graphs and networks have long been a subject 
of significant mathematical and scientific interest, 
deemed worthy of study for their own sake and of- 
ten associated with scientific data. However, a di- 
verse and rapidly growing set of contemporary ap- 
plications is fast giving rise to massive networks that 
themselves comprise the data set of interest — and to 
analyze these network data, practitioners in turn re- 
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1.1 Modern Network Data Sets 

Though once primarily the domain of social sci- 
entists, a view of networks as data objects is now 
of interest to researchers in areas spanning biol- 
ogy, finance, engineering, and library science, among 



others. Newman (20031 provides an extensive re- 



view of modern network data sets; other examples of 
note include mobile phone records, which link cus- 



tomers according to their phone calls (Eagle et al 



20081; the internet, including both web pages con- 



nected by hyperlinks ( Adamic and Huberman , 2000 ) 



and peer-to-peer networks (Stutzbach et al. , 20061; 



electrical power networks, in which local grids are 
physically connected by long-distance transmission 



lines (Watts and Strogatz, 19981; and publication 



networks, where citations provide explicit links be- 



tween authors (de Solla Price 19651. 



At the same time, other scientific fields are begin- 
ning to reinterpret traditional data sets as networks, 
in order to better understand, summarize, and visu- 
alize relationships amongst very large numbers of 
observations. Examples include protein-protein in- 
teraction networks, with isolated pairs of proteins 
deemed connected if an experiment suggests that 



they interact (Batada et al. 2006); online financial 



transactions, whereupon items are considered to be 



linked if they are typically purchased together (Jin 



et al. 20071; food webs, with species linked by 



predator- prey relationships (Dunne et al. , 2002); and 



2007). 



spatial data sets (Thompson 2006 Ceyhan et al 



1.2 Organization and Aims of the Paper 

The above examples attest both to the wide vari- 
ety of networks encountered in contemporary appli- 
cations, as well as the multiple expanding literatures 
on their analysis. In this paper, we focus on introduc- 
ing the subject from first principles and framing key 
inferential questions. We begin with a discussion of 
relational data in Section [2] and introduce notation 
to make the connection to networks precise. We dis- 
cuss model specification and inference in Section [Sj 
by way of concrete definitions and examples. We in- 
troduce new ideas for detecting network structure in 
Section |4j and apply them to data analysis by way 
of formal testing procedures. In Section [5] we discuss 
open problems and future challenges for large-scale 
network inference in the key areas of model elicita- 



tion, approximate fitting procedures, and issues of 
data sampling. In a concluding appendix we provide 
a more thorough introduction to the current liter- 
ature, highlighting contributions to the field from 
statistics as well as a variety of other disciplines. 

2. NETWORKS AS RELATIONAL DATA 

We begin our analysis by making explicit the 
connection between networks and relational data. 
In contrast to data sets that may that arise from 
pairwise distances or affinities of points in space or 
time, many modern network data sets are massive, 
high dimensional, and non-Euclidean in their struc- 
ture. We therefore require a way to describe these 
data other than through purely pictorial or tab- 
ular representations — and the notion of cataloging 
the pairwise relationships that comprise them, with 
which we begin our analysis, is natural. 

2.1 Relational Data Matrices and Covariates 

Graphs provide a canonical representation of rela- 
tional data as follows: Given n entities or objects of 
interest with pairs indexed by (i, j), we write i ~ j if 
the ith and jth entities are related, and i j other- 
wise. These assignments may be expressed by way of 
an nxn adjacency matrix A, whose entries {^jj} are 
nonzero if and only if i ~ j. While both the structure 
of A and the field over which its entries are defined 
depend on the application or specific data set, a nat- 
ural connection to graph theory emerges in which 
entities are represented by vertices, and relations by 
edges; we adopt the informal but more suggestive 
descriptors "node" and "link," respectively. The de- 
gree of the ith node is in turn defined as Sj=i ^ij- 

In addition, the data matrix A is often accompa- 
nied by covariates c(i) associated with each node, 
i G {1,2,. Example 2.1 below illustrates a 



case in which these covariates take the form of binary 
categorical variables. We shall refer back to these 
illustrative data throughout Sections |2] and |3) and 
later in Section |4] will consider a related real- world 



example: the social network recorded by Zachary 
( |1977 ), in which nodes represent members of a col- 
legiate karate club and links represent friendships, 
with covariates indicating a subsequent split of the 
club into two disjoint groups. 

Example 2.1 (Network Data Set). As an exam- 
ple data set, consider the 10-node network defined 
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Fig 1 . The network data of Example \2.1\ with nodes indexed 
by number and binary categorical covariate values by shape. 
Note that no Euclidean embedding accompanies the data, mak- 
ing visualization a challenging task for large-scale networks 



by data matrix A and covariate vector 


c as follows 
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A visualization of the corresponding network is 
shown in Fig. [TJ however, note that as no geometric 
structure is implied by the data set itself, a pictorial 
rendering such as this is arbitrary and non-unique. 



In Example 2.1, categorical covariates c{i),i G 
{1,2, ... ,n} are given; however, in almost all net- 
work data sets of practical interest, these covariates 
are latent. This in turn gives rise to many of the prin- 
cipal questions of network inference — in contrast to 
the traditional setting of relational data. Therefore, 
the issues of network modeling which arise tend to be 
distinct; as such, classical approaches (e.g., contin- 
gency table tests) are directly applicable to network 
data only in very restricted circumstances. 

2.2 Networks as Distinct From Relational Data 

The main distinction between modern-day net- 
work data and classical relational data lies in the 



requisite computational complexity for inference. In- 
deed, the computational requirements of large-scale 
network data sets are substantial. With n nodes we 
associate (2) = n{n — l)/2 symmetric relations; be- 
yond this quadratic scaling, latent covariates give 
rise to a variety of combinatorial expressions in n. 
Viewed in this light, methods to determine relation- 
ships amongst subsets of nodes can serve as an im- 
portant tool to "coarsen" network data. In addition 
to providing a lower-dimensional summary of the 
data, such methods can serve to increase the com- 
putational efficiency of subsequent inference proce- 
dures by enabling data reduction and smoothing. 
The general approach is thus similar to modern tech- 
niques for high-dimensional Euclidean data, and in- 
deed may be viewed as a clustering of nodes into 
groups. 

From a statistical viewpoint, this notion of subset 
relations can be conveniently described by a A;-ary 
categorical covariate, with k specifying the (poten- 
tially latent) model order. By incorporating such a 
covariate into the probability model for the data ad- 
jacency matrix A, the "structure" of the network 
can be directly tested if this covariate is observed, 
or instead inferred if latent. It is easily seen that 
the cardinality of the resultant model space is ex- 
ponential in the number of nodes n; even if the 
category sizes themselves are given, we may still 
face a combinatorial inference problem. Thus, even 
a straightforwardly-posed hypothesis test for a rel- 
atively simple model can easily lead to cases where 
exact inference procedures are intractable. 

3. MODEL SPECIFICATION AND INFERENCE 

Fields such as probability, graph theory, and 
computer science have each posited specific mod- 
els which can be applied to network data; however, 
when appealing to the existing literature, it is of- 
ten the case that neither the models nor the anal- 
ysis tools put forward in these contexts have been 
developed specifically for inference. In this section, 
we introduce two basic network models and relate 
them to classical statistics. The first such model con- 
sists of nodes that are probabilistically exchange- 
able, whereas the second implies group structure 
via latent categorical covariates. Inferring relation- 
ships amongst groups of nodes from data in turn re- 
quires the standard tools of statistics, including pa- 
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rameter estimation and hypothesis testing. We pro- 
vide examples of such procedures below, illustrating 
their computational complexity, and introduce cor- 
responding notions of approximate inference. 

3.1 Erdos-Renyi: A First Illustrative Example 

We begin by considering one of the simplest pos- 
sible models from random graph theory, attributed 
tolErdos and Renyil (|1959|) and IGilbertl (|1959|), and 



consisting of pairwise links that are generated inde- 
pendently with probability p. Under this model, all 
nodes are exchangeable; it is hence appropriate to 
describe instances in which no group structure (by 
way of categorical covariates) is present. In turn, we 
shall contrast this with an explicit model for struc- 
ture below. 

Adapted to the task of modeling undirected net- 
work data, the Erdos-Renyi model may be expressed 
as a sequence of (2) Bernoulli trials corresponding to 
off-diagonal elements of the adjacency matrix A. 

Definition 1 (Erdos-Renyi Model). Let n > 1 
be integral and fix some p € [0, 1]. The Erdos-Renyi 
random graph model corresponds to matrices A £ 
{0,1}"'^" defined element- wise as 



V^,J G {1,2, . . . ,n} : i < j. 



A 



Bernoulli (p); 



Erdos-Renyi thus provides a one-parameter model 
yielding independent and identically distributed bi- 
nary random variables representing the absence or 
presence of pairwise links between nodes; as this 
binary relation is symmetric, we take A 



The additional stipulation An = for all i implies 
that our relation is also irreflexive; in the language 
of graph theory, the corresponding (undirected, un- 
weighted) graph is said to be simple, as it exhibits 
neither multiple edges nor self-loops. The event i ~ j 
is thus a Bernoulli (p) random variable for all i ^ j, 
and it follows that the degree X]j=i ^ij of each net- 
work node is a Binomial(n — l,p) random variable. 

Fitting the parameter p is straightforward; the 
maximum likelihood estimator (MLE) corresponds 
to the sample proportion of observed links: 



P 



1 



Kj 



1 



n{n — 1) 



EE^^r 

i=i j=i 



Example 2.1, for instance, yields p = 14/45. 

Given a relational data set of interest, we can test 
the agreement of data in A with this model by em- 
ploying an appropriately selected test statistic. If we 
wish to test this uniformly generic model with re- 
spect to the notion of network structure, we may 
explicitly define an alternate model and appeal to 
the classical Neyman-Pearson testing framework. 

In this vein, the Erdos-Renyi model can be gen- 
eralized in a natural way to capture the notion of 
local rather than global exchangeability: we simply 
allow Bernoulli parameters to depend on k-aiy cat- 
egorical covariates c(i) associated with each node 
i G {1,2,. where the k < n categories rep- 
resent groupings of nodes. Formally we define 



c G 



c(i) :{l,2,...,n} 



and a set of {^~^^) distinct Bernoulli parameters gov- 
erning link probabilities within and between these 
categories, arranged into a k x k symmetric matrix 
and indexed as Pc{i)c(j) ^1 J S {1, 2, . . . , n}. 

In the case of binary categorical covariates, we 



immediately obtain a formulation of Holland and 



Leinhardt (19811, the simplest example of a so- 



called stochastic block model. In this network model, 
pairwise links between nodes correspond again to 
Bernoulli trials, but with a parameter chosen from 
the set {poo,Poi,Pii} according to binary categorical 
covariates associated with the nodes in question. 

Definition 2 (Simple Stochastic Block Model). 
Let c G {0, 1}" be a binary n- vector for some integer 
n > 1, and fix parameters Poo^PoiiPii £ [0,1]- Set 
Pio = Poi'i the model then corresponds to matrices 
A G {0,1}"^" defined element- wise as 

Vi,i G {1, 2, . . . , n} : i < j, ~ Bernouni(pc(i)c(i)); 



A 



-3 1 



A-- A-- 



0. 



If the vector of covariates c is given, then find- 
ing the maximum-likelihood parameter estimates 
{poOiPoi)Pii} is trivial after a re-ordering of nodes 
via permutation similarity: For any n x n permu- 
tation matrix 11, the adjacency matrices A and 
IIAn' represent isomorphic graphs, the latter fea- 
turing permuted rows and columns of the former. 
If n re-indexes nodes according to their categorical 
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groupings, then we may define a conformal partition 



nAn' 



loo 



lOl 



.^01 ^11 



that respects this ordering, such that exchange- 
ability is preserved within — but not across — 
submatrices Aqo and An. We may then simply 
compute sample proportions corresponding to each 
submatrix {Aqo, Aqi, An} to yield {poo,Poi,Pii}- 

Note that by construction, submatrices Aqo and 
^11 yield subgraphs that are themselves Erdos- 
Renyi, and are said to be induced by the two respec- 
tive groups of categorical covariates. Nonzero entries 
of Aqi are said to comprise the edge boundary be- 
tween these two induced subgraphs; indeed, the ma- 
trix obtained by setting all entries of Aqo and An 
to zero yields in turn a bipartite graph whose ver- 
tices can be partitioned according to their binary 
covariate values. 

The following example illustrates these concepts 



using the simulated data of Example 2.1 



Example 3.1 (Similarity and Subgraphs). Let 



the 10-node network of Example 2.1 be subject to 
an isomorphism that re-orders nodes according to 
the two groups defined by their binary covariate val- 
ues, and define the permuted covariate vector c and 
permutation-similar data matrix A as follows: 
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Figure [2] illustrates the corresponding subgraphs us- 
ing the visualization of Fig. [Tj assuming a simple 
stochastic block model in turn leads to the following 
maximum-likelihood parameter estimates: 
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Fig 2. Subgraphs based on the binary covariates of Exam- 
ple \2.1\ again represented graphically by node shape. The con- 
formal partition of Example \3. 1\ implies two induced subgraphs: 
s^hd lines inside the ellipse are links represented in submatrix 
Aoo, while those outside it appear as links m An. The re- 
maining links, shown as dashed lines, correspond to values of 
1 in submatrix Aoi and comprise the associated edge boundary 



Example 3.1 illustrates the ease of model fitting 
when binary-valued covariates are known; the notion 
of permutation similarity plays a similar role in the 
case of /c-ary covariates. 

3.2 Approximate Inference 

The careful reader will have noted that in the 
case of known categorical covariates, examples such 
as those above can be expressed as contingency 
tables — a notion we revisit in Section |4] below — 
and hence may admit exact inference procedures. 
However, if covariates are latent, then an appeal to 
maximum-likelihood estimation induces a combina- 
torial optimization problem; in general, no fast al- 
gorithm is known for likelihood maximization over 
the set of covariates and Bernoulli parameters under 
the general /c-group stochastic block model. 

The principal difficulty arises in maximizing the 
n-dimensional fc-ary covariate vector c over an ex- 
ponentially large model space; estimating the i^^^) 
associated Bernoulli parameters then proceeds in ex- 
act analogy to Example |3.1| above. The following 
example illustrates the complexity of this inference 
task. 

Example 3.2 (Permutation and Maximization). 
Consider a 100-node network generated according to 
the stochastic block model of Definition [2] with each 
group of size 50 and poo = Pii = 1/2, Poi = 0. Fig- 
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(a) Simulated data matrix A 
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: -f"^^ "t ■■'■I.7: { '.v^ 



(b) Transformed data IlAn' 

Fig 3. Representations A and UAH' of data drawn from the 
stochastic block model of Example corresponding to iso- 
morphic graphs (black boxes denote links). Though poi — 0, 
only a small subset of permutation similarity transformations 
n(-)n' will reveal the disconnected nature of this network 



ure[3]shows two permutation-similar adjacency ma- 
trices, A and IIAn', that correspond to isomorphic 
graphs representing this network; inferring the vec- 
tor c of binary categorical covariates from data A in 



Fig. 3(a) is equivalent to finding a permutation simi- 



larity transformation IIAn' that reveals the distinct 
division apparent in Fig. |3(b)[ 

Given the combinatorial nature of this problem, 
it is clear that fitting models to real-world network 
data can quickly necessitate approximate inference. 



To this end. Example 3.2 motivates an important 
means of exploiting algebraic properties of network 
adjacency structure: the notion of a graph spectrum. 
Eigenvalues associated with graphs reveal several of 



five semidefinite, the spectrum of a labeled graph is 
typically defined via a Laplacian matrix L as follows. 

Definition 3 (Graph Laplacian). Let i ^ j 
denote a symmetric adjacency relation defined on 
an n-node network. An associated n x n symmet- 
ric, positive-semidefinite matrix L is called a graph 
Laplacian if, for alH, j G {1, 2 . . . , n} : i / j, we have 



L : 



Lij < if i ~ J, 
Lij = ifi oo j. 



L 



■J* 



Li 



Note that the diagonal of L is defined only implic- 
itly, via the requirement of positive-semidefiniteness; 
a typical diagonally-dominant completion termed 
the combinatorial Laplacian takes L = D—A, where 
D is a diagonal matrix of node degrees such that 
Da = J2]j=i ^ij- All important result is that the di- 
mension of the kernel of L is equal to the number of 
connected components of the corresponding graph; 
hence poi = implies in turn that at least two eigen- 
values of L will be zero in Example |3.2| above. 

Correspondingly, Fiedler (19731 termed the 



second-smallest eigenvalue of the combinatorial 
Laplacian the algebraic connectivity of a graph, and 
recognized that positive and negative entries of the 
corresponding eigenvector (the "Fiedler vector" ) de- 
fine a partition of nodes that nearly minimizes the 
number of edge removals needed to disconnect a net- 
work. In fact, in the extreme case of two equally 
sized, disconnected subgraphs — as given by Exam- 
ple |3.2) — this procedure exactly maximizes the likeH- 
hood of the data under a two-group stochastic block 
model; more generally, it provides a means of ap- 
proximate inference that we shall return to in Sec- 
tion |4] below. 



As reviewed by von Luxburg ( 2007 1 , the observa- 



tion of Fiedler was later formalized as an algorithm 



termed spectral bisection (Pothen et al. 1990), and 



indeed leads to the more general notion of spectral 



clustering (von Luxburg et al. 20081. This remains 



an active area of research in combinatorics and the- 
oretical computer science, where a simple stochastic 
block model with poo,pii > poi is termed a "planted 



their key properties (|Chung| [l997l) at a computa- partition" model ( [Bollobas and Scottj |2004p 



tional cost that scales as the cube of the number 
of nodes, offering an appealing alternative in cases 
where exact solutions are of exponential complexity. 
As the adjacency matrix A itself fails to be posi- 



4. TESTING FOR NETWORK STRUCTURE 

Identifying some degree of structure within a net- 
work data set is an important prerequisite to formal 
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Table 1 

Contingency table specifying counts of intra- and 
inter- subgroup links for the data of Section\4-.l\ 



Fig 4. Visualization of the Zachary karate data of Section \4- 1\ 
Nodes are numbered and binary categorical covariate values, 
reflecting the subsequent group split, are indicated by shape 



statistical analysis. Indeed, if all nodes of a network 
are truly unique and do not admit any notion of 
grouping, then the corresponding data set — no mat- 
ter how large — is really only a single observation. On 
the other hand, if every node can be considered inde- 
pendent and exchangeable under an assumed model, 
then depicting the data set as a network is unhelp- 
ful: the data are best summarized as n independent 
observations of nodes whose connectivity structure 
is uninformative. 

In this section we invoke a formal hypothesis test- 
ing framework to explore the notion of detecting net- 
work structure in greater detail, and propose new 
approaches that are natural from a statistical point 
of view but have thus far failed to appear in the 
literature. To illustrate these ideas we apply three 
categories of tests to a single data set — that of Sec- 
tion 4.1 below — and in turn highlight a number of 



important topics for further development. 
4.1 The Zachary Karate Data 



Zachary (1977) recorded friendships between 34 
members of a collegiate karate club that subse- 
quently split into two groups of size 16 and 18. These 
data are shown in Fig. |4] with inter- and intra-group 
links given in Table [T] The network consists of 78 
links, with degree sequence (ordered in accordance 
with the node numbering of Fig. |4]) given by 

(16,9,10,6,3,4,4,4,5,2,3,1,2,5,2,2,2, 
2,2,3,2,2,2,5,3,3,2,4,3,4,3,6,13,17), 



Counts 






# Links 


# No Links 


Total 


Intra-subgroup : 


0- 


-0 


33 


87 


120 


Inter-subgroup : 


0- 


-1 


10 


278 


288 


Intra-subgroup : 


1- 


-1 


35 


118 


153 


Total 






78 


483 


561 



and corresponding sample proportion of observed 
links given hy p = 78 /{^^) = 78/561. 

Sociologists have interpreted the data of Zachary 
not only as evidence of network structure in this 
karate club, but also as providing binary categorical 
covariate values through an indication of the subse- 
quent split into two groups, as per Fig. |4| This in 
turn provides us with an opportunity to test various 
models of network structure — including those intro- 
duced in Section |3] — with respect to ground truth. 

4.2 Tests with Known Categorial Covariates 

We begin by posing the question of whether or 
not the most basic Erdos-Renyi network model of 
Definition [T] — with each node being equally likely to 
connect to any other node — serves as a good descrip- 
tion of the data, given the categorical variable of ob- 
served group membership. The classical evaluation 
of this hypothesis comes via a contingency table test. 

Example 4.1 (Contingency Table Test). Con- 
sider the data of Section |4.1| When categorical co- 
variates are known, a contingency table test for in- 
dependence between rows and columns may be per- 
formed according to the data shown in Table [T| The 
Pearson test statistic T^2 in this case evaluates to 
over 47, and with only 2 degrees of freedom, the cor- 
responding p- value for these data is less than 10~^. 

In this case, the null hypothesis — that the Erdos- 
Renyi model's sole Bernoulli parameter can be used 
to describe both inter- and intra-subgroup connec- 
tion probabilities — can clearly be rejected. 



As in the case of Zheng et al. (2006) and others, 
this approach has been generally used to reject 
an Erdos-Renyi null when given network data in- 
clude a categorical covariate for each node. (A cau- 
tionary reminder is in order: employing this method 
when covariates are inferred from data corresponds 



to a misuse of maximally selected statistics (Altman 
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et al. 1994).) Of course, in cases where it is compu- 
tationally feasible, we may instead use simulation to 
determine the exact distribution of any chosen test 
statistic T under whichever null model is assumed. 

4.3 The Case of Latent Categorial Covariates 

The Erdos-Renyi model of Definition [T] clearly 
implies a lack of network structure through its 
nodal exchangeability properties, thus supporting its 
use as a null model in cases such as Example |4.1| 
and those described above. In contrast, the partial 
exchangeability exhibited by the stochastic block 
model of Definition |2] suggests its use as an alter- 
nate model that explicitly exhibits network struc- 
ture. To this end, the usual Neyman-Pearson logic 
implies the adoption of a generalized likelihood ratio 
test statistic: 



0.8 
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£ 0.4 

q; 

0.2 



-* ROC Upper Bound 

Test Statistic T^^ 

Test Statistic T — 



0.2 0.4 0.6 0.8 1 

False Positive Rate a 

Fig 5. Receiver operating characteristic (ROC) curves cor- 
responding to tests of the data of Section \41\ with Erdos- 
Renyi null and two-group stochastic block model alternate. 
Test statistics and T^-~ were calculated via simulation, 

LR Var ' 

with the ROC upper bound obtained using knowledge of the 
true group membership for each node 



Tlr 



P i>j 

max sup Yl^i^ij ;Poo,Poi,Pii,c{i),c{j)) 

^ P00,P01,Pll j>j 

n p^'^ (1 -p)^"^'^ 

«>i 

max sup U{Pc(i)cij))^''{'^ -PciiMj))^'"^'' 

POCPOl.Pll j>j 



appropriate network models and test statistics may 
require more careful consideration. 

Example 4.3 (Degree Variance Test). Suppose 



we adopt instead the test statistic of Snijders ( 1981 1: 



As we have seen in Section 3.2 however, maximiz- 



ing the likelihood of the covariate vector c E {0, 1}" 
in general requires an exhaustive search. Faced with 
the necessity of approximate inference, we recall that 
the spectral partitioning algorithms outlined earlier 
in Section |3.2| provide an alternative to exact like- 
lihood maximization in c. The resultant test statis- 
tic Tj~j^ is computationally feasible, though with re- 
duced power, and to this end way may test the data 
of Section HH] as follows. 



Var 



n 



T n n 



i=l j=l 



X \ 2 

4 = 1 J = l 



Example 4.2 (Generalized Likelihood Ratio Test). 
Let Tlji be the test statistic associated with a gen- 
eralized likelihood ratio test of Erdos-Renyi versus 
a two-group stochastic block model, and cor- 
respond to an approximation obtained by spectral 
partitioning in place of the maximization over group 



membership. For the data of Section 4.1 , simulation 
yields a corresponding p- value of less than 10~^ with 
respect to T-^, with Fig. |5j confirming the power of 
this test. 

Our case study has so far yielded reassuring re- 
sults. However, a closer look reveals that selecting 



the sample variance of the observed degree sequence 
J2^=i ^ij- glance at the data of Section 4.1 indi- 
cates the poor fit of an Erdos-Renyi null, and in- 
deed simulation yields a p-value of less than 10~^. 
Figure [5| however, reveals that T^^^ possesses very 
little power. 

This dichotomy between a low p-value, and yet 
low test power, highlights a limitation of the models 
exhibited thus far: in each case, both the expected 
degree sequence and the corresponding node con- 
nectivity properties are determined by exactly the 
same set of model parameters. In this regard, test 
statistics depending on the data set only through 
its degree sequence can prove quite limiting, as the 
difference between the two models under consider- 
ation lies entirely in their node connectivity prop- 
erties, rather than the heterogeneity of their degree 
sequences. 

Indeed, significant degree variation is a hallmark 
of many observed network data sets, the data of Sec- 
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tion |4 ■ 1 1 included ; sometimes certain nodes are sim- 
ply more connected than others. In order to conclude 
that rejection of a null model necessarily implies the 
presence of network structure expressed through cat- 
egorical covariates, a means of allowing for heteroge- 
nous degree sequences must be incorporated into the 
null as well as the alternate. 

4.4 Decoupling Degree Sequence & Connectivity 

An obvious way to decouple properties of the de- 
gree sequence from those of connectivity is to restrict 
the model space to only those networks exhibiting 
the observed degree sequence. However, simulation 
of such graphs becomes nontrivial when they are re- 
stricted to be simple (i.e., without multiple edges 
or self-loops), thus rendering the test calculations 



E ^ 



of Section 4.2 more difficult to achieve in practice. 
Correspondingly, such fixed-degree models have re- 
mained largely absent from the literature to date. 
Recent advances in graph simulation methods. 



however, help to overcome this barrier 


Viger and 


Latapy 


2005 


Blitzstein and Diaconis 




2006 


). The 


importance sampling approach of Blitzstein and Di- 



aconis 



Section 



( 2006 1 enables us here to test the data set of 
4.1 using fixed-degree models that match its 
observed degree sequence. Although the correspond- 
ing normalizing constants cannot be computed in 
closed form, we may specify a proposal distribution, 
draw samples, and calculate unnormalized impor- 
tance weights. 

Example 4.4 (Fixed-Degree Test). Consider 
the set of all simple graphs featuring an observed de- 
gree sequence, and define a null model under which 
each of these graphs is equally likely. As an alternate 
model, let each graph be weighted in proportion to 
its likelihood under the two-group stochastic block 
model of Definition [2j in this case the normalizing 
constant will depend on parameters poOi Poii and 
pii. The corresponding fixed- degree generalized like- 
lihood ratio test statistic Tlr-fd is given in analogy 



to Example 4.2 by 



1 



max sup U^{Aij;Poo,Poi,Pu,c{i),c{j))' 

^ P00,P01,Pll j>j 

Just as before, calculation of Tiji_pf) requires 
a combinatorial search over group assignments c; 
moreover, the fixed-degree constraint precludes an 



0.98 



ffi 0.96 



^ 0.94 



0.92 



0.9 



-* ROC Upper Bound 

Test Statistic T- „ 



0.01 



0.02 0.03 
False Positive Rate a 



0.04 



0.05 



Fig 6. ROC curve of T^^ using fixed- degree models for 
both the null and alternate hypotheses. (The stepped appear- 
ance of the curve is an artifact of the importance sampling 
weights.) Also shown is an ROC upper bound, obtained using 
knowledge of the true group membership for each node 



analytical sup operation over parameters poo, poi, 
and pii. We therefore define an approximation 
'^LR-FD employing spectral partitioning in place 
of the maximization over group membership, and 
substituting the analytical sup under two-group 
stochastic block likelihood for the exact sup opera- 
tion. The substantial power of this test for the data 
of Section 4.1 is visible in Fig. [6j the estimated p- 
value of this data set remains below 10~^. 



Note that specification of parameters poO) Poi) a-^d 
pii was required to generate Fig. [6] via simulation; 
here, we manually fit these three parameters to the 
data, starting with their estimates under the two- 
group stochastic block model, until the likelihood of 
the observed data approached the median likelihood 
under our parameterization. A more formal fitting 
procedure could of course be adopted in practice. 

5. OPEN PROBLEIVIS IN NETWORK INFERENCE 

The examples of Sections |3]and|4]were designed to 
be illustrative, and yet they also serve to illuminate 
broader questions that arise as we seek to extend 
classical notions of statistics to network data. As we 
have seen in Section [3j for instance, the inclusion 
of latent k-aiy categorical covariates immediately 
necessitates a variety of combinatorial calculations. 
The increasing prevalence of large, complex network 
data sets presents an even more significant compu- 
tational challenge for statistical inference. Indeed, 
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longstanding inferential frameworks — as exemplified 
by the hypothesis tests of Section |4] for instance — 
are crucial to the analysis of networks and relational 
data, and yet their implementations can prove re- 
markably difficult even for small data sets. 

To address these broader questions and impact 
the future of network inference, we believe that 
statisticians should focus on the following three 
main categories of open problems, whose descrip- 
tions comprise this remainder of this section: 

1. We must work to specify models that can more 
realistically describe observed network data. 
For instance, the fixed-degree models intro- 
duced earlier account explicitly for heteroge- 
neous degree sequences; in the case of large- 
scale network data sets, even more fiexible mod- 
els are needed. 

2. We must build approximations to these models 
for which likelihood maximization can be read- 
ily achieved, along with tools to evaluate the 
quality of these approximations. The spectral 
partitioning approach featured in our examples 
of Section [4] serves as a prime example; however, 
validation of approximate inference procedures 
remains an important open question. 

3. We must seek to understand precisely how net- 
work sampling influences our statistical analy- 
ses. In addition to better accounting for data 
gathering mechanisms, sampling can serve as a 
method of data reduction. This in turn will en- 
able the application of a variety of methods to 
data sets much larger than those exhibited here. 

5.1 Model Elicitation and Selection 

More realistic network models can only serve 
to benefit statistical inference, regardless of their 



presently several competing models of this type, each 



with its own merits: stochastic block models (Wang 



computational or mathematical convenience (Banks 



and Constantine 19981. Models tailored to differ- 
ent fields, and based on theory fundamental to spe- 
cific application areas, are of course the long-term 
goal — with the exponential random graph models 



( 


1999 


) and 


Snijders 



et al. ( 2006 1 among the most successful and widely 
known to date. However, additional work to de- 
termine more general models for network structure 
will also serve to benefit researchers and practition- 
ers alike. As detailed in the Appendix, there are 



and Wong 19871, block models with mixed mem- 



bership (Airoldi et al. 20071, and structural mod- 



els that explicitly incorporate information regarding 
the degree sequence in addition to group member- 



ship dChung et ah] [2003]). 

At present, researchers lack a clearly articu- 
lated strategy for selecting between these differ- 
ent modeling approaches — the goodness-of-fit pro- 



cedures of Hunter et al. (20081, based on graphical 
comparisons of various network statistics, provide 
a starting point, but comparing the complexity of 
these different modeling strategies poses a challenge. 
Indeed, it is not even entirely clear how best to se- 
lect the number of groups used in a single modeling 



strategy alone. For the data of Section 4.1 for exam- 
ple, we restricted our definition of network structure 
to be a binary division of the data into two groups, 
whereas many observed data sets may cluster into 
an a priori unknown number of groups. 

It is also worth noting that many different fields 
of mathematics may provide a source for network 
data models. While graph theory forms a natural 
starting point, other approaches based on a combi- 
nation of random matrices, algebra, and geometry 
may also prove useful. For example, the many graph 
partitioning algorithms based on spectral methods 
suggest the use of corresponding generative mod- 
els based on the eigenvectors and eigenvalues of the 
graph Laplacian. The primary challenge in this case 
appears to be connecting such models to the ob- 
served data matrix A, which typically consists of 
binary entries. 

5.2 Approximate Inference and Validation 

Computationally or mathematically convenient 
models will also continue to play a key role in net- 
work analysis. Even simple generic models of struc- 
ture are very high-dimensional, and with network 
data sets commonly consisting of thousands to mil- 
lions of nodes, model dimensionality spirals out of 
control at an impossible rate. Somehow this funda- 
mental challenge of network data — how to grapple 
with the sheer number of relational observations — 
must be turned into a strength so that further 
analysis may proceed. Reducing the dimensionality 
through an approximate clustering is an excellent 
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first step to build upon, but computationally realiz- 
able inference schemes must also follow in turn. 

The usefulness of such approximations will ulti- 
mately be determined by the extent to which evalu- 
ation tools can be developed for massive data sets. 
Whenever models are sufficiently complex to necessi- 
tate approximate inference procedures, such models 
must be paired with mechanisms to relate the qual- 
ity of the resulting analysis back to the original prob- 
lem and model specification. Indeed, assurances are 
needed to convince thoughtful practitioners that an- 
alyzing a different model, or maximizing a quantity 
other than desired likelihood, is a useful exercise. 

Other approaches to validation may focus on the 
outcome of the analysis in some way, rather than its 
theoretical underpinnings. With ground truth by its 
very definition available only for small-scale illustra- 
tive problems, or for those which are generally felt 
to have already been solved, prediction may provide 
a valuable substitute. By monitoring the results of 
approximation over time relative to revealed truth, 
confidence in the adopted inference procedure may 
be grown. 

5.3 Sampling, Missingness, and Data Reduction 

A final concern is to better understand how sam- 
pling mechanisms influence network inference. Con- 
sider that two critical assumptions almost always 
underpin the vast majority of contemporary network 
analyses: First, that all links within the collection of 
observed nodes have been accounted for; and sec- 
ond, that observed nodes within the network com- 
prise the only nodes of interest. In general, neither 
of these assumptions may hold in practice. 

To better understand the pitfalls of the first as- 
sumption, consider that while observing the pres- 
ence of a link between nodes is typically a feasible 
and well defined task, observing the absence of a 
link can in many cases pose a substantial challenge. 
Indeed, careful reflection often reveals that zero en- 
tries in relational data sets are often better thought 



of as unobserved (Clauset et al. 2008 Marchette 



and Priebe 2008 1 . The implications of this fact for 
subsequent analysis procedures — as well as on ap- 
proximate likelihood maximization procedures and 
spectral methods in particular — remains unclear. 

The second assumption, that all nodes of inter- 
est have in fact been recorded, also appears rarely 



justified in practice. Indeed, it seems an artifact of 
this assumption that most commonly studied data 
sets consist of nodes which form a connected net- 
work. While in some cases the actual network may 
in fact consist of a single connected component, re- 
searchers may have unwittingly selected their data 
conditioned upon its appearance in the largest con- 
nected component of a much larger network. How 
this selection in turn may bias the subsequent fit- 
ting of models has only recently begun to be inves- 



tigated (Handcock and Gile, 20091 



A better understanding of missingness may also 
lend insight into optimal sampling procedures. Al- 
though researchers themselves may lack influence 
over data gathering mechanisms, the potential of 
such methods for data reduction is clear. One 
particularly appealing approach is to first sample 
very large network data sets in a controlled man- 
ner, and then apply exact analysis techniques. In 
some cases the resultant approximation error can be 



bounded ( Belabbas and Wolfe , 2009 1 , implying that 



the effects on inferential procedures in question can 
be quantified. 

Other data reduction techniques may also help to 
meet the computational challenges of network analy- 



sis; for example, Krishnamurthy et al. (20071 exam- 



ined contractions of nodes into groups as a means of 
lessening data volume. Such strategies of reducing 
network size while preserving relevant information 
provide an alternative to approximate likelihood 
maximization that is deserving of further study. 

6. CONCLUSION 

In many respects, the questions being asked of 
network data sets are not at all new to statisticians. 
However, the increasing prevalence of large networks 
in contemporary application areas gives rise to both 
challenges and opportunities for statistical science. 
Tests for detecting network structure in turn form a 
key first step toward more sophisticated inferential 
procedures, and moreover provide practitioners with 
much-needed means of formal data analysis. 

Classical inferential frameworks are precisely 
what is most needed in practice, and yet as we 
have seen, their exact implementation can prove re- 
markably difficult in the setting of modern high- 
dimensional, non-Euclidean network data. To this 
end, we hope that this paper has succeeded in help- 
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ing to chart a path toward the ultimate goal of a 
unified and coherent framework for the statistical 
analysis of large-scale network data sets. 
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APPENDIX: A REVIEW OF APPROACHES TO 
NETWORK DATA ANALYSIS 

Three canonical problems in network data analy- 
sis have consistently drawn attention across different 
contexts: network model elicitation, network model 
inference, and methods of approximate inference. 

A.l Model Elicitation 

With new network data sets being generated 
or discovered at rapid rates in a wide variety of 
fields, model elicitation — independent even of model 
selection — remains an important topic of investi- 
gation. Although graph theory provides a natural 
starting point for identifying possible models for 
graph-valued data, practitioners have consistently 
found that models such as Erdos-Renyi lack suffi- 
cient explanatory power for complex data sets. Its 
inability to model all but the simplest of degree dis- 
tributions has forced researchers to seek out more 
complicated models. 

(|2005D survey 



Barabasi (20021 and Palla et al. 



a wide variety of network data sets and conclude 
that commonly encountered degree sequences follow 
a power law or similarly heavy-tailed distribution; 
the Erdos-Renyi model, with its marginally binomial 
degree distribution, is obviously insufficient to de- 



scribe such data sets. Barabasi and Albert (1999) 



introduced an alternative by way of a generative 
network modeling scheme termed "preferential at- 
tachment" to explicitly describe power-law degree 
sequences. Under this scheme, nodes are added se- 
quentially to the graph, being preferentially linked to 
existing nodes based on the current degree sequence. 
A moment's reflection will convince the reader that 
this model is in fact an example of a Dirichlet pro- 



Though the preferential attachment approach 
serves to describe the observed degree sequences of 
many networks, it can fall short of correctly model- 



ing their patterns of connectivity (Li et al. 20051; 



moreover, heterogenous degree sequences many not 
necessarily follow power laws. A natural solution to 
both problems is to condition on the observed degree 



sequence as in Section 4.4 and consider the connec- 
tions between nodes to be random. As described ear- 
lier, the difficulties associated with simulating fixed- 
degree simple graphs have historically dissuaded re- 
searchers from this direction, and hence fixed-degree 
models have not yet seen wide use in practice. 

As an alternative to fixed-degree models, re- 
searchers have instead focused on the so-called con- 
figuration model (Newman et al. 20011 as well as 



models which yield graphs of given expected de- 
gree (Chung et al. 2003). The configuration model 



cess (Pemantle 2007). 



specifies the degree sequence exactly, as with the 
case of fixed-degree models, but allows both multi- 
ple links between nodes and "self-loops" in order to 
gain a simplified asymptotic analysis. Models featur- 
ing given expected degrees specify only the expected 
degree of each node — typically set equal to the ob- 
served degree — and allow the degree sequence to be 
random. Direct simulation becomes possible if self- 
loops and multiple links are allowed, thus enabling 
approximate inference methods of the type described 
in Section [3^ However, observed network data sets 
do not always exhibit either of these phenomena, 
thus rendering the inferential utility of these mod- 
els highly dependent on context. In the case of very 
large data sets, for example, the possible presence or 
absence of multiple connections or self-loops in the 
model may be irrelevant to describing the data on 
a coarse scale. When it becomes necessary to model 
network data at a fine scale, however, a model which 
allows for these may be insufficiently realistic. 

Graph models may equally well be tailored to spe- 
cific fields. For example, sociologists and statisticians 
working in concert have developed a class of well- 
known approaches collectively known as exponential 
random graph models (ERGMs) or alternatively as 
p* models. Within this class of models, the probabil- 
ity of nodes being linked to each other depends ex- 
plicitly on parameters that control well-defined suf- 
ficient statistics; practitioners draw on sociological 
theory to determine which connectivity statistics are 
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critical to include within the model. A key advantage 
of these models is that they can readily incorporate 
covariates into their treatment of connectivity prop- 
erties. For a detailed review, along with a discussion 
of some of the latest developments in the field of 
social networks, the reader is referred to lAnderson 



et al. (1999) and Snijders et al. (20061 



Since their original introduction, ERGMs have 
been widely adopted as models for social networks. 
They have not yet, however, been embraced to the 
same extent by researchers outside of social network 
analysis. Sociologists can rely on existing theory to 
select models for how humans form relationships 
with each other; researchers in other fields, though, 
often cannot appeal to equivalent theories. For ex- 
ploratory analysis, they may require more generic 
models to describe their data, appearing to prefer 
models with a latent vector of covariates to cap- 
ture probabilistically exchangeable blocks. Indeed, 



as noted in Section 3.1 , this approach falls under the 



general category of stochastic block modeling. Wang 



and Wong (19871 detail similarities and differences 



between this approach and the original specification 
of ERGMs. 



Stochastic block modeling, though relatively 
generic, may still fail to adequately describe net- 
works in which nodes roughly group together, yet in 
large part fail to separate into distinct clusters. In 
cases such as this, where stochastic exchangeability 
is too strong an assumption, standard block model- 
ing breaks down. To this end, two possible modeling 
solutions have been explored to date in the litera- 
ture. 



Hoff et al. (20021 introduced a latent space ap- 



proach, describing the probability of connection as 
a function of distances between nodes in an unob- 
served space of known dimensionality. In this model, 
the observed grouping of nodes is a result of their 
proximity in this latent space. In contrast, Airoldi| 



et al. (2007) retained the explicit grouping struc- 
ture that stochastic block modeling provides, but 
introduced the idea of mixed group membership to 
describe nodes that fall between groups. Node mem- 
bership here is a vector describing partial member- 
ship in all groups, rather than an indicator variable 
specifying a single group membership. 



A. 2 Model Fitting and Inference 

Even when a model or class of models for net- 
work data can be specified, realizing inference can 
be challenging. One of the oldest uses of random 
graph models is as a null; predating the computer. 



Moreno and Jennings (19381 simulated a random 



graph model quite literally by hand in order to tabu- 
late null model statistics. These authors drew cards 
out of a ballot shuffling apparatus to generate graphs 
of the same size as a social network of schoolgirls 
they had observed. Comparing the observed statis- 
tics to the distribution of tabulated statistics, they 
rejected the hypothesis that the friendships they 
were observing were formed strictly by chance. 

Asymptotic tests may alleviate the need for sim- 
ulation in cases of large network data sets, and are 
available for certain models and test statistics — the 
X^-test of Section 



4.2 



being one such example. As 



another example, Holland and Leinhardt (1981) de- 



veloped asymptotic tests based on likelihood ratio 
statistics to select between different ERGMs. Soci- 
ologists and statisticians together have developed re- 
sults for other test statistics as well, many of which 



are reviewed by Wasserman and Faust (1994). 



A desire upon the rejection of a null model, of 
course, is the fitting of an alternate. However, as 



demonstrated in Section 3.2 direct fitting by max- 
imum likelihood can prove computationally costly, 
even for basic network models. A common solution 
to maximizing the likelihood under an ERGM, for 
example, is to employ a Markov chain Monte Carlo 



strategy (Snijders et al. 2006). Handcock et al. 



(2007) also used such methods to maximize the like- 



lihood of a latent space network model; additionally, 
these authors suggested a faster, though approxi- 
mate, two-stage maximization routine. 

Other researchers have employed greedy algo- 
rithms to maximize the model likelihood. INewmanl 



and Leicht (2007) used expectation-maximization 



(EM) to fit a network model related to stochas- 
tic block modeling. Relaxing the precise require- 



ments of the EM algorithm, both Hofman and Wig- 



gins] ( |2008D and [Airoldi et al] ( |2008[ ) have applied a 



variational Bayes approach (see, e.g., Jordan et al. 



(19991) to find maximum likelihood estimates of pa- 
rameters under a stochastic block model. IReichardtl 



and Bornholdt (2004) applied simulated annealing 



to maximize the likelihood of network data under 
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a Potts model, a generalization of the Ising model. 



Rosvall and Bergstrom (2007 2008) have also em- 



tic to relate the observed number of edges between 
groups to their expected number under the configu- 



ployed simulated annealing in network inference in 
order to maximize information-theoretic functionals 
of the data. 

Following any kind of model fitting procedure, a 
goodness-of-fit test of some kind is clearly desirable. 
Yet, researchers have thus far struggled to find a 



ration model outlined in Section above. Spectral 
clustering methods can also be applied to the task of 
approximately maximizing modularity, in a manner 
that enables both group size and number to vary. A 
wide variety of alternative maximization approaches 



have been applied as well: Both Wang et al. (2007 1 



clear solution to this problem. Hunter et al. (2008) and Brandes et al. (2008) review the computational 



have proposed a general method of accumulating a 
wide set of network statistics, and comparing them 
graphically to the distribution of these same statis- 
tics under a fitted model. Networks which fit well 
should in turn exhibit few statistics that deviate far 
from those simulated from the corresponding model. 

A. 3 Approximate Inference Procedures 

In most cases of practical interest, and in partic- 
ular for large network data sets, model likelihoods 
cannot be maximized in a computationally feasible 
manner, and researchers must appeal to a heuris- 
tic that yields some approximately maximized quan- 
tity. With this goal in mind, the idea of likelihood 
maximization has been subsumed by the idea of fast 
graph partitioning described in Section |3.2[ as it is 
the process of determining group membership which 
typically poses the most computational challenges. 
The invention of new algorithms that can quickly 
partition large graphs is clearly of great utility here. 

A. 3.1 Algorithmic Approaches Computer scien- 
tists and physicists have long been active in the 
creation of new graph partitioning algorithms. In 
addition to techniques such as spectral bisection, 
many researchers have also noted that the inher- 
ently sparse nature of most real- world adjacency 
structures enables faster implementations of spec- 



tral methods (see, e.g.. White and Smyth (2005)). 



Researchers have sought to also incorporate graph 
partitioning concepts that allow for multiple parti- 



tions of varying sizes. Some researchers, such as Eck 



mann and Moses (2002) and Radicchi et al. (2004), 



have attempted to use strictly local statistics to aid 
in the clustering of nodes into multiple partitions. 



Girvan and Newman (2002) focused in contrast on 



global statistics, by way of measures of the centrality 
of a node relative to the rest of the graph. This line 
of reasoning eventually resulted in the introduction 



difficulties associated with maximization of the mod- 
ularity statistic, and relate this to known combina- 



torial optimization problems. Fortunato and Castel- 



lano (2007) review many recently proposed maxi- 



mization routines and contrast them with traditional 
methods. 

A. 3. 2 Evaluation of Efficacy Approximate pro- 
cedures in turn require some way to evaluate 
the departure from exact likelihood maximization. 
Thus far, a clear way to evaluate partitions found 
through the various heuristics cited above has not 
yet emerged, though many different approaches 



have been proposed. Both.Massen and Doye (2006) 



and Karrer et al. ( 2008 ) have explored ways to test 



the statistical significance of the output of graph 
partitioning algorithms. Their methods attempt to 
determine whether a model which lacks structure 
could equally well explain the group structure in- 
ferred from the data. These approaches, though dis- 
tinct from one another, are both akin to performing 
a permutation test — a method known to be effec- 
tive when applied to more general cases of cluster- 



ing. Carley and Banks (1993) apply this exact idea 



to test for structure when group memberships are 
given. 

Other researchers have attempted a more empiri- 
cal approach to the problem of partition evaluation 
by adopting a metric to measure the distance be- 
tween found and "true" partitions. Such distances 
are then examined for a variety of data sets and sim- 
ulated cases for which the true partition is assumed 
known. In this vein Danon et al. (2005) specified 



an explicit probability model for structure and com- 
pared how well different graph partitioning schemes 
recovered the true subgroups of data, ranking them 
by both execution time as well as average distance 
between true and found partitions. Gustafsson et al. 



(2006) performed a similar comparison, along with a 



of modularity (Newman 2006) as a global statis- study of differences in "found" partitions between al- 
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gorithms for several well-known data sets, including 
the karate club data of Section |4.1[ They found that 
standard clustering algorithms (e.g., fc-means) some- 
times outperform more specialized network parti- 



tion algorithms. Finally, Fortunato and Barthelemy 



(2007) have undertaken theoretical investigations of 



the sensitivity and power of a particular partitioning 
algorithm to detect subgroups below a certain size. 
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